fix: lead responsiveness (error suppression, circuit breaker, routing, auth)#2328
fix: lead responsiveness (error suppression, circuit breaker, routing, auth)#2328
Conversation
When a Claude session exits with an error (e.g. expired OAuth token), the error text was being auto-posted to the channel as if it were normal assistant output. This caused raw JSON API errors to appear in user-facing channels. Now, when flush_auto_output detects a Result event with is_error: true, it clears the pending events instead of posting them. Errors are still logged to daemon.log for debugging. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2328 +/- ##
==========================================
- Coverage 64.00% 63.93% -0.07%
==========================================
Files 100 100
Lines 38588 38578 -10
==========================================
- Hits 24699 24666 -33
- Misses 13889 13912 +23
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7292e8777a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| let has_error_result = events.iter().any(|e| { | ||
| matches!( | ||
| e, | ||
| crate::headless::StreamEvent::Result { is_error: true, .. } | ||
| ) |
There was a problem hiding this comment.
Defer error suppression until turn end
flush_auto_output suppresses only when the current events batch already contains StreamEvent::Result { is_error: true }, but drain_session_output also flushes on the 2-second timer before the result event arrives. In error cases where an assistant error message is emitted first and the timer fires before the Result, this branch is bypassed and the raw error text is posted; the later error result then clears an empty buffer. That means the new “suppress error output” behavior is still violated under this timing.
Useful? React with 👍 / 👎.
Channel leads had no max retry limit — when a lead's auth expired, it would crash, wait 120s, crash again, and repeat forever. Workers already had MAX_WORKER_RESTARTS=3 with ops escalation, but leads were missing this protection. Now ensure_lead_for_channel checks failure_count and stops retrying after MAX_LEAD_RESTARTS=3 consecutive failures, posting to ops instead. The daemon also uses the correct per-kind max and escalation key format. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Workers are spawned with channel=dm-{name}, not the task's channel.
This meant @ALL in a topic channel only found agents indexed under
that channel (the lead), missing workers whose tasks belong there.
The fix adds a second scan after the by_channel lookup that finds
workers by checking task.channel instead of agent.channel. The nudge
dedup set prevents double-nudging workers that happen to be indexed
in the correct channel.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
resolve_nudge_action returned RespawnAndDeliver for any stopped agent with no session_id, regardless of whether a SpawnFailure cooldown was active. This meant every user message to a dead lead triggered a new spawn attempt, bypassing both the 120s cooldown and the circuit breaker (MAX_LEAD_RESTARTS) from ensure_channel_leads_alive. Now resolve_nudge_action checks is_active(SpawnFailure, key) before returning RespawnAndDeliver. Leads use channel as key, workers use task_id. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Cycle 4 fix only added cooldown checks to RespawnAndDeliver (no session_id). ResumeAndDeliver (has session_id) was unchecked, meaning agents with preserved session_ids (e.g. auth errors that don't trigger "No conversation found") could be resumed on every user message. Moved the cooldown check before the session_id branch so it applies to both resume and respawn paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The web UI's Re-authenticate button triggered an OAuth code flow (BROWSER=false, capture URL, paste code) that didn't work in practice. The only reliable auth method is running `claude auth login` directly, which opens the browser natively for OAuth. Changes: - Backend: simplified auth_login route to run `claude auth login` without BROWSER=false, waits up to 5 min for completion, then restarts all agents - Removed: /api/auth/login/code endpoint, AuthLoginProcess struct, pending_auth_login state - UI: replaced code-paste flow with simple Login button that shows "Logging in" state while waiting for browser auth to complete Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Addresses Codex review: flush_auto_output's error check only worked
when the Result event was in the same batch as the error text. If the
2-second timer fired between the error text and the Result, the text
leaked through.
Moved error suppression from flush_auto_output to the drain loop:
- Set session_errored flag when Result { is_error: true } is received
- Skip timer flushes, stream-end flushes, and turn-end flushes when
the flag is set
- flush_auto_output is now a pure extract-and-post function
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Addressed Codex review in bf69bb3: P1 — Defer error suppression until turn end: Moved the error check from |
Pre-existing biome error exposed by touching Channel.svelte. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pre-existing biome noNonNullAssertion error in store.test.ts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Biome noNonNullAssertion errors — replace ! with optional chaining or nullish coalescing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Downgrade noNonNullAssertion to warn for .svelte files (reactive code uses $store! assertions extensively and they are valid at runtime) - Apply biome auto-fix for Channel.svelte optional chaining - Apply biome format fix for api.ts startAuthLogin signature Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
AccountPanel.svelte also had the old startAuthLogin + submitAuthCode flow. Updated to use the simplified direct-login approach (matching Channel.svelte changes). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Six fixes for channel lead responsiveness, found via dogfood testing:
1. Suppress error output from auto-posting to channels
2. Add circuit breaker for lead spawn failures
MAX_LEAD_RESTARTS=3, escalates to ops3. Topic channel @ALL now finds workers in dm channels
dm-{name}were invisible to channel-scoped@all4+5. Nudge resume/respawn respects spawn failure cooldowns
ResumeAndDeliverandRespawnAndDelivernow check cooldowns6. Replace broken OAuth code flow with direct
claude auth loginclaude auth logindirectly, opens native browser/api/auth/login/code,AuthLoginProcess,pending_auth_loginSpec updates: v2-spec.md §1.4, §4.1, §4.4
Test plan
flush_auto_output_suppresses_error_results/_posts_normal_resultslead_gives_up_after_max_spawn_failuresat_all_in_topic_finds_workers_in_dm_channelsstopped_lead_with_active_cooldown_resolves_to_dropstopped_worker_with_active_cooldown_resolves_to_dropstopped_lead_with_session_id_and_active_cooldown_resolves_to_dropcargo clippy+cargo fmtclean🤖 Generated with Claude Code